Search CORE

252 research outputs found

Dynamic Data Structures for Document Collections and Graphs

Author: Munro J. Ian
Nekrich Yakov
Vitter Jeffrey Scott
Publication venue
Publication date: 19/03/2015
Field of study

In the dynamic indexing problem, we must maintain a changing collection of text documents so that we can efficiently support insertions, deletions, and pattern matching queries. We are especially interested in developing efficient data structures that store and query the documents in compressed form. All previous compressed solutions to this problem rely on answering rank and select queries on a dynamic sequence of symbols. Because of the lower bound in [Fredman and Saks, 1989], answering rank queries presents a bottleneck in compressed dynamic indexing. In this paper we show how this lower bound can be circumvented using our new framework. We demonstrate that the gap between static and dynamic variants of the indexing problem can be almost closed. Our method is based on a novel framework for adding dynamism to static compressed data structures. Our framework also applies more generally to dynamizing other problems. We show, for example, how our framework can be applied to develop compressed representations of dynamic graphs and binary relations

arXiv.org e-Print Archive

CiteSeerX

Crossref

A Bulk-Parallel Priority Queue in External Memory with STXXL

Author: GS Brodal
J Singler
JS Vitter
L Arge
MC Pinotti
N Deo
P Sanders
P Sanders
PJ Varman
R Dementiev
Publication venue
Publication date: 01/01/2015
Field of study

We propose the design and an implementation of a bulk-parallel external memory priority queue to take advantage of both shared-memory parallelism and high external memory transfer speeds to parallel disks. To achieve higher performance by decoupling item insertions and extractions, we offer two parallelization interfaces: one using "bulk" sequences, the other by defining "limit" items. In the design, we discuss how to parallelize insertions using multiple heaps, and how to calculate a dynamic prediction sequence to prefetch blocks and apply parallel multiway merge for extraction. Our experimental results show that in the selected benchmarks the priority queue reaches 75% of the full parallel I/O bandwidth of rotational disks and and 65% of SSDs, or the speed of sorting in external memory when bounded by computation.Comment: extended version of SEA'15 conference pape

arXiv.org e-Print Archive

Crossref

KITopen

Competitive Parallel Disk Prefetching and Buffer Management

Author: Barve Rakesh
Kallahalla Mahesh
Varman Peter J.
Vitter Jeffrey Scott
Publication venue: 'Elsevier BV'
Publication date: 21/03/2011
Field of study

We provide a competitive analysis framework for online prefetching and buffer management algorithms in parallel I/O systems, using a read-once model of block references. This has widespread applicability to key I/O-bound applications such as external merging and concurrent playback of multiple video streams. Two realistic lookahead models, global lookahead and local lookahead, are defined. Algorithms NOM and GREED based on these two forms of lookahead are analyzed for shared buffer and distributed buffer configurations, both of which occur frequently in existing systems. An important aspect of our work is that we show how to implement both the models of lookahead in practice using the simple techniques of forecasting and flushing. Given a -disk parallel I/O system and a globally shared I/O buffer that can hold upto disk blocks, we derive a lower bound of on the competitive ratio of any deterministic online prefetching algorithm with lookahead. NOM is shown to match the lower bound using global -block lookahead. In contrast, using only local lookahead results in an competitive ratio. When the buffer is distributed into portions of blocks each, the algorithm GREED based on local lookahead is shown to be optimal, and NOM is within a constant factor of optimal. Thus we provide a theoretical basis for the intuition that global lookahead is more valuable for prefetching in the case of a shared buffer configuration whereas it is enough to provide local lookahead in case of the distributed configuration. Finally, we analyze the performance of these algorithms for reference strings generated by a uniformly-random stochastic process and we show that they achieve the minimal expected number of I/Os. These results also give bounds on the worst-case expected performance of algorithms which employ randomization in the data layout

KU ScholarWorks

Hierarchical Bin Buffering: Online Local Moments for Dynamic External Memory Arrays

Author: Chakrabarti K.
Daniel Lemire
Geffner S.
Gray J.
Lemire D.
Li B.-C.
Moerkotte G.
Owen Kaser
Schmidt R. R.
Scott D.
Vitter J. S.
Zhou F.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/05/2008
Field of study

Local moments are used for local regression, to compute statistical measures such as sums, averages, and standard deviations, and to approximate probability distributions. We consider the case where the data source is a very large I/O array of size n and we want to compute the first N local moments, for some constant N. Without precomputation, this requires O(n) time. We develop a sequence of algorithms of increasing sophistication that use precomputation and additional buffer space to speed up queries. The simpler algorithms partition the I/O array into consecutive ranges called bins, and they are applicable not only to local-moment queries, but also to algebraic queries (MAX, AVERAGE, SUM, etc.). With N buffers of size sqrt{n}, time complexity drops to O(sqrt n). A more sophisticated approach uses hierarchical buffering and has a logarithmic time complexity (O(b log_b n)), when using N hierarchical buffers of size n/b. Using Overlapped Bin Buffering, we show that only a single buffer is needed, as with wavelet-based algorithms, but using much less storage. Applications exist in multidimensional and statistical databases over massive data sets, interactive image processing, and visualization

arXiv.org e-Print Archive

R-libre

Crossref

Cylindrical Static and Kinetic Binary Space Partitions

Author: Agarwal Pankaj K.
Guibas Leonidas J.
Murali T. M.
Vitter Jeffrey Scott
Publication venue: 'Elsevier BV'
Publication date: 01/01/2000
Field of study

P. K. Agarwal, L. Guibas, T. M. Murali, and J. S. Vitter. “Cylindrical Static and Kinetic Binary Space Partitions,” Computational Geometry, 16(2), 2000, 103–127. An extended abstract appears in Proceedings of the 13th Annual ACM Symposium on Computational Geometry (SCG ’97), Nice, France, June 1997, 39–48

Elsevier - Publisher Connector

KU ScholarWorks

Cylindrical Static and Kinetic Binary Space Partitions

Author: Agarwal Pankaj K.
Guibas Leonidas J.
Murali T. M.
Vitter Jeffrey Scott
Publication venue: 'Elsevier BV'
Publication date: 21/03/2011
Field of study

KU ScholarWorks

Low Space External Memory Construction of the Succinct Permuted Longest Common Prefix Array

Author: D Okanohara
J Fischer
J Fischer
J Fischer
J Kärkkäinen
J Kärkkäinen
J Kärkkäinen
J Sirén
JI Munro
JS Vitter
K Sadakane
K Sadakane
P Ferragina
P Ferragina
P Ferragina
R Dementiev
T Beller
T Kasai
U Manber
W Hon
W Szpankowski
Publication venue
Publication date: 01/01/2016
Field of study

The longest common prefix (LCP) array is a versatile auxiliary data structure in indexed string matching. It can be used to speed up searching using the suffix array (SA) and provides an implicit representation of the topology of an underlying suffix tree. The LCP array of a string of length

n

can be represented as an array of length

n

words, or, in the presence of the SA, as a bit vector of

2n

bits plus asymptotically negligible support data structures. External memory construction algorithms for the LCP array have been proposed, but those proposed so far have a space requirement of

O(n)

words (i.e.

O(n \log n)

bits) in external memory. This space requirement is in some practical cases prohibitively expensive. We present an external memory algorithm for constructing the

2n

bit version of the LCP array which uses

O(n \log \sigma)

bits of additional space in external memory when given a (compressed) BWT with alphabet size

\sigma

and a sampled inverse suffix array at sampling rate

O(\log n)

. This is often a significant space gain in practice where

\sigma

is usually much smaller than

n

or even constant. We also consider the case of computing succinct LCP arrays for circular strings

arXiv.org e-Print Archive

Crossref

MPG.PuRe

Random sampling with a reservoir

Author: ERNVALL J.
FAN C. T.
FELLER W.
FELLER W.
Jeffrey S. Vitter
KNUTH D.E.
VITTER J.S.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref

When Random Sampling Preserves Privacy

Author: C. Dwork
C. Dwork
J. Vitter
O. Goldreich
S. Chawla
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2006
Field of study

Abstract. Many organizations such as the U.S. Census publicly release samples of data that they collect about private citizens. These datasets are first anonymized using various techniques and then a small sample is released so as to enable “do-it-yourself ” calculations. This paper investigates the privacy of the second step of this process: sampling. We observe that rare values – values that occur with low frequency in the table – can be problematic from a privacy perspective. To our knowledge, this is the first work that quantitatively examines the relationship between the number of rare values in a table and the privacy in a released random sample. If we require ɛ-privacy (where the larger ɛ is, the worse the privacy guarantee) with probability at least 1 − δ, we say that 1 a value is rare if it occurs in at most Õ ( ) rows of the table (ignoring log ɛ factors). If there are no rare values, then we establish a direct connection between sample size that is safe to release and privacy. Specifically, if we select each row of the table with probability at most ɛ then the sample is O(ɛ)-private with high probability. In the case that there are t rare values, then the sample is Õ(ɛδ/t)-private with probability at least 1 − δ.

CiteSeerX

Crossref

Improving Cache Utilization of Linear Relaxation Methods: Theory and Practice

Author: C. Leiserson
J. Vitter
K. Davis
M. Wolfe
R.W. Floyd
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/1999
Field of study

Crossref